Search CORE

136 research outputs found

Contractive De-noising Auto-encoder

Author: C.C. Chang
C.J. Burges
D.E. Rumelhart
G.E. Hinton
G.E. Hinton
H. Bourlard
P. Vincent
Publication venue
Publication date: 01/01/2014
Field of study

Auto-encoder is a special kind of neural network based on reconstruction. De-noising auto-encoder (DAE) is an improved auto-encoder which is robust to the input by corrupting the original data first and then reconstructing the original input by minimizing the reconstruction error function. And contractive auto-encoder (CAE) is another kind of improved auto-encoder to learn robust feature by introducing the Frobenius norm of the Jacobean matrix of the learned feature with respect to the original input. In this paper, we combine de-noising auto-encoder and contractive auto- encoder, and propose another improved auto-encoder, contractive de-noising auto- encoder (CDAE), which is robust to both the original input and the learned feature. We stack CDAE to extract more abstract features and apply SVM for classification. The experiment result on benchmark dataset MNIST shows that our proposed CDAE performed better than both DAE and CAE, proving the effective of our method.Comment: Figures edite

arXiv.org e-Print Archive

Crossref

Hierarchical multi-stream posterior based speech secognition system

Author: H. Bourlard
H. Hermansky
L. Mangu
L.R. Rabiner
S. Dupont
Publication venue
Publication date: 01/01/2006
Field of study

Abstract. In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on “state gamma posterior ” definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs.This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the stateof-the-art Tandem systems.

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project

Author: Bell P.
Bourlard H.
Ferras M.
Guillemot M.
Ingram S.
McInnes F.
Pappas N.
Popescu-Belis A.
Renals S.
Publication venue
Publication date: 01/08/2013
Field of study

In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events ” (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Edinburgh Research Explorer

Towards Lower Error Rates in Phoneme Recognition

Author: H. Bourlard
K. Lee
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Crossref

Microphone array post-filter based on noise field coherence

Author: H. Bourlard
I.A. McCowan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

How does a dictation machine recognize speech?

Author: Bourlard
F Jelinek
H Bourlard
H Bourlard
JA Bilmes
JW Picone
K Murphy
LR Rabiner
O Cappé
R Polikar
RO Duda
S Thorvaldsen
S Young
TK Moon
Publication venue: Centre du Parc, Rue Marconi 19, 1920 Martigny, Idiap
Publication date: 01/01/2009
Field of study

There is magic (or is it witchcraft?) in a speech recognizer that transcribes continuous radio speech into text with a word accuracy of even not more than 50%. The extreme difficulty of this task, tough, is usually not perceived by the general public. This is because we are almost deaf to the infinite acoustic variations that accompany the production of vocal sounds, which arise from physiological constraints (co-articulation), but also from the acoustic environment (additive or convolutional noise, Lombard effect), or from the emotional state of the speaker (voice quality, speaking rate, hesitations, etc.)46. Our consciousness of speech is indeed not stimulated until after it has been processed by our brain to make it appear as a sequence of meaningful units: phonemes and words. In this Chapter we will see how statistical pattern recognition and statistical sequence recognition techniques are currently used for trying to mimic this extraordinary faculty of our mind (4.1). We will follow, in Section 4.2, with a MATLAB-based proof of concept of word-based automatic speech recognition (ASR) based on Hidden Markov Models (HMM), using a bigram model for modeling (syntactic-semantic) language constraints

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Representation of Time-Varying Stimuli by a Network Exhibiting Oscillations on a Faster Time Scale

Author: A Bragin
A Bruns
A Delorme
A Gunawardana
A Rokem
AL Giraud
BJ Rhodes
C Börgers
C Tallon-Baudry
CM Glaze
CM Gray
DV Buonomano
EM Izhikevich
FE Theunissen
G Buzsáki
G Laurent
GB Christianson
GB Ermentrout
H Bourlard
HK Hartline
I Nelken
J Beshel
J Fritz
JB Kruskal
JJ Hopfield
JM Palva
JP Donoghue
KJ de Jong
KJ Maloney
LC Osborne
M Bastiaansen
M Bazhenov
M Bazhenov
M Shamir
Maoz Shamir
MS Olufsen
N Brunel
N Brunel
Nancy Kopell
O Ghitza
O Jensen
Oded Ghitza
P Lakatos
P Tass
Peter E. Latham
R Gütig
R Van Rullen
R VanRullen
RC deCharms
RD Traub
RT Canolty
S Furukawa
S Greenberg
S Panzeri
SK Kuffler
SL Hooper
SM Chase
Steven Epstein
T Gruber
V Digilakis
W Maass
Y Loewenstein
ZN Aldworth
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/05/2009
Field of study

Sensory processing is associated with gamma frequency oscillations (30–80 Hz) in sensory cortices. This raises the question whether gamma oscillations can be directly involved in the representation of time-varying stimuli, including stimuli whose time scale is longer than a gamma cycle. We are interested in the ability of the system to reliably distinguish different stimuli while being robust to stimulus variations such as uniform time-warp. We address this issue with a dynamical model of spiking neurons and study the response to an asymmetric sawtooth input current over a range of shape parameters. These parameters describe how fast the input current rises and falls in time. Our network consists of inhibitory and excitatory populations that are sufficient for generating oscillations in the gamma range. The oscillations period is about one-third of the stimulus duration. Embedded in this network is a subpopulation of excitatory cells that respond to the sawtooth stimulus and a subpopulation of cells that respond to an onset cue. The intrinsic gamma oscillations generate a temporally sparse code for the external stimuli. In this code, an excitatory cell may fire a single spike during a gamma cycle, depending on its tuning properties and on the temporal structure of the specific input; the identity of the stimulus is coded by the list of excitatory cells that fire during each cycle. We quantify the properties of this representation in a series of simulations and show that the sparseness of the code makes it robust to uniform warping of the time scale. We find that resetting of the oscillation phase at stimulus onset is important for a reliable representation of the stimulus and that there is a tradeoff between the resolution of the neural representation of the stimulus and robustness to time-warp. Author Summary Sensory processing of time-varying stimuli, such as speech, is associated with high-frequency oscillatory cortical activity, the functional significance of which is still unknown. One possibility is that the oscillations are part of a stimulus-encoding mechanism. Here, we investigate a computational model of such a mechanism, a spiking neuronal network whose intrinsic oscillations interact with external input (waveforms simulating short speech segments in a single acoustic frequency band) to encode stimuli that extend over a time interval longer than the oscillation's period. The network implements a temporally sparse encoding, whose robustness to time warping and neuronal noise we quantify. To our knowledge, this study is the first to demonstrate that a biophysically plausible model of oscillations occurring in the processing of auditory input may generate a representation of signals that span multiple oscillation cycles.National Science Foundation (DMS-0211505); Burroughs Wellcome Fund; U.S. Air Force Office of Scientific Researc

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)

Directory of Open Access Journals

PubMed Central

Stochastic Modelling: From Pattern Classification to Speech Recognition and Language Translation

Author: AJ Robinson
AP Dempster
B Efron
F Jelinek
F Jelinek
H Bourlard
H Ney
H Ney
H Ney
H Ney
H Ney
L Breiman
LE Baum
LR Bahl
PF Brown
RO Duda
S Ortmanns
S Pietra Della
W Wahlster
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1999
Field of study

This paper gives an overview of the stochastic modelling approach to machine translation. Starting with the Bayes decision rule as in pattern classification and speech recognition, we show how the resulting system architecture can be structured into three parts: the language model probability, the string translation model probability and the search procedure that gener-ates the word sequence in the target language. We discuss the properties of the system components and report results on the translation of spoken dialogues in the VERBMOBIL project. The experience obtained in the VERB-MOBIL project, in particular a large-scale end-to-end evaluation, showed that the stochastic modelling approach resulted in significantly lower error rates than three competing translation approaches: the sentence error rate was 29 % in comparison with 52 % to 62% for the other translation approaches.

CiteSeerX

Crossref

Neural Network Classification and Prior Class Probabilities

Author: A. F. Murray
A. Krogh
A.S. Weigend
AAMI
B.D. Ripley
D.W. Ruck
E. Barnard
E. Barnard
E. Barnard
E. Wan
F. Kanaya
H. White
H.A. Bourlard
H.A. Bourlard
J.B. Hampshire
M.D. Richard
N. S. Cardell
N.A. Weiss
P.A. Shoemaker
R. Anand
R. Rojas
S. Geman
S. Haykin
S. Lawrence
Y. L. Cun
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref